Building Topic Models Based on Anchor Words

نویسندگان

  • Ankur Moitra
  • Tim Roughgarden
  • CATALIN VOSS
چکیده

Suppose you were given a stack of documents, such as all of the articles published in a particular newspaper, and your goal was to make sense of this data, to determine topics that this data may be made up from. To frame this as an unsupervised learning problem, suppose the documents were written in a foreign language and came from a foreign planet. By understanding topics that these documents are about, you would be able to, given a new document, determine what characteristics it shares with other documents and uncover, ultimately, what it is about. In theory, we refer to this as the problem of unsupervised Topic Modeling, first introduced by Dave Blei et al. Of course, instead of taking news articles, this can be framed with genome sequences, audio tracks, images, and all sorts of data. The problem of Topic Modeling informally aims to discover hidden topics in documents, then annotate them according to these in order to summarize the collection of documents. This falls into the modern AI challenge of developing tools for automatic data comprehension. In the 2012 paper Learning Topic Models: Going beyond SVD [1], Sanjeev Arora, Rong Ge, and Ankur Moitra present a new method for unsupervised learning of topics, namely that of NonNegative Matrix Factorization (NMF) and provide provable bounds for the error in learning. Arora et al. motivate NMF as a more naturally derived tool for topic learning than the current approach prevailing in theory that is Singular Value Decomposition (SVD). The authors present a polynomialtime algorithm, building on their previous study on NMF [3], that similar to SVD can be realized mostly in linear algebra operations and thus achieves better running time than other local-search approaches both in theory and in practice, where the number of documents suffices to be

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Is Your Anchor Going Up or Down? Fast and Accurate Supervised Topic Models

Topic models provide insights into document collections, and their supervised extensions also capture associated document-level metadata such as sentiment. However, inferring such models from data is often slow and cannot scale to big data. We build upon the “anchor” method for learning topic models to capture the relationship between metadata and latent topics by extending the vector-space rep...

متن کامل

Tandem Anchoring: a Multiword Anchor Approach for Interactive Topic Modeling

Interactive topic models are powerful tools for understanding large collections of text. However, existing sampling-based interactive topic modeling approaches scale poorly to large data sets. Anchor methods, which use a single word to uniquely identify a topic, offer the speed needed for interactive work but lack both a mechanism to inject prior knowledge and lack the intuitive semantics neede...

متن کامل

یک مدل موضوعی احتمالاتی مبتنی بر روابط محلّی واژگان در پنجره‌های هم‌پوشان

A probabilistic topic model assumes that documents are generated through a process involving topics and then tries to reverse this process, given the documents and extract topics. A topic is usually assumed to be a distribution over words. LDA is one of the first and most popular topic models introduced so far. In the document generation process assumed by LDA, each document is a distribution o...

متن کامل

Anchor-Free Correlated Topic Modeling: Identifiability and Algorithm

In topic modeling, many algorithms that guarantee identifiability of the topics have been developed under the premise that there exist anchor words – i.e., words that only appear (with positive probability) in one topic. Follow-up work has resorted to three or higher-order statistics of the data corpus to relax the anchor word assumption. Reliable estimates of higher-order statistics are hard t...

متن کامل

Bigram Anchor Words Topic Model

A probabilistic topic model is a modern statistical tool for document collection analysis that allows extracting a number of topics in the collection and describes each document as a discrete probability distribution over topics. Classical approaches to statistical topic modeling can be quite effective in various tasks, but the generated topics may be too similar to each other or poorly interpr...

متن کامل

Low-dimensional Embeddings for Interpretable Anchor-based Topic Inference

The anchor words algorithm performs provably efficient topic model inference by finding an approximate convex hull in a high-dimensional word co-occurrence space. However, the existing greedy algorithm often selects poor anchor words, reducing topic quality and interpretability. Rather than finding an approximate convex hull in a high-dimensional space, we propose to find an exact convex hull i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014